What filtering thresholds did you choose and how did you decide on them? Upon consulting the original publication, I decided to filter out cells with fewer than 500 expressed genes, cells with fewer than 800 UMIs, and cells with mitochondrial percentage greater than 10% for the sake of reproduction. These are pretty permissive thresholds as they keep in the outliers on the higher end of the expressed genes and UMI distributions which I probably would have filtered out. However, tt makes sense to use consistent filtering threshold across all samples which is another reason I decided to replicate this reasoning from the original publication.
How many cells / genes are present before and after implementing your filtering thresholds?
load("qc_summary.RData")
qc_summary <- rbind(
qc_summary,
data.frame(
Sample = "Total",
Cells_Before = sum(qc_summary$Cells_Before, na.rm = TRUE),
Genes_Before = sum(qc_summary$Genes_Before, na.rm = TRUE),
Cells_After = sum(qc_summary$Cells_After, na.rm = TRUE),
Genes_After = sum(qc_summary$Genes_After, na.rm = TRUE)
)
)
qc_summary
## Sample Cells_Before Genes_Before Cells_After Genes_After
## 1 HB17_background 36223 21238 1683 21238
## 2 HB17_PDX 70643 26492 67005 26492
## 3 HB17_tumor 73238 26552 67219 26552
## 4 HB30_PDX 86723 26660 22965 26660
## 5 HB30_tumor 85180 27350 28727 27350
## 6 HB53_background 79374 26247 47873 26247
## 7 HB53_tumor 85108 27456 27791 27456
## 8 Total 516489 181995 263263 181995
## 9 Total 1032978 363990 526526 363990
## 10 Total 2065956 727980 1053052 727980
There were a total of 1,032,978 cells and 363,990 genes across all samples before filtering and doublet removal. After filtering there were 526,526 cells and 363,990 genes. The individual sample before and after filtering numbers can be seen in the table above.
Look in the literature, what are some potential strategies to set thresholds that don’t rely on visual inspection of plots? There is MAD to identify outliers in QC metrics like number of genes, UMIs, or percent mitochondrial reads. This method uses Median Absolute Deviation and captures the central distribution while removing statistical outliers.Tools like CellBender which is a deep-learning based tool, automatically identify ambient RNA, empty droplets, and technical artifacts.
### Integrated Violin
### Doublet
Detection
##Normalization I am using Seurat’s default global-scaling normalization method, which normalizes the gene expression measurements for each cell by the total expression, multiplies by a scale factor (10,000 by default), and then log-transforms the result. This procedure helps account for differences in sequencing depth across cells.
Seurat “vst” method was used with standard options which selects the top 2,000 highly variable features So for my downstream analysis, I am including 2000 highly variable genes and excluding (28992 - 2000) genes. For a dataset this size, choosing 2000 highly variable features is reasonable. Seurat’s default method doesn’t use an absolute threshold (like a fixed variance value). It ranks all genes by their standardized variance (after accounting for mean expression), and then selects the top n. In my case, I went with Seurats standard top n of 2000.
## PCA
Based on the elbow plot below I decided to go with 20 PCs for the
downstream analysis. ### PCA Figures
I used the standard Seurat settings to Find Neighbors and Find Clusters. I used 20 PCS and a resolution of 0.5 to get THIS MANY clusters. How many cells come from each sample individually? How many total cells present in the entire dataset? Based on the UMAP plots I created, I decided that integration is needed. When labeling by sample ID, the data was clustering on sample and so not well integrated. It made sense that the data was clustering a lot on type but for sample the data should not have been splitting on it.
###Clustering and Visualization plots
I used Harmony to perform the integration. After running the UMAP again, I saw that though the sample still drove some of the separation there was a lot more overlap which made me feel better about the biological significance of the clusters and their annotations going forward.
###Integration Plots
Briefly describe the method performed to identify marker genes. Discuss a few advantages and disadvantages of the method used to perform marker gene analysis.
I used Seurats FindAllMarkers to indentify marker genes. The function identifies marker genes by performing differential expression testing between each cluster and all other cells, typically using the Wilcoxon rank-sum test by default.The advantage of this method is that it doesnt assume normality but a disadvantage is that it is computationally intensive and is sensitive to clustering quality.
###Marker Gene Analysis Figures
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## 1 0.000000e+00 4.210278 0.561 0.082 0.000000e+00 0 CST4
## 2 0.000000e+00 2.472077 0.464 0.183 0.000000e+00 0 SERPINE2
## 3 0.000000e+00 2.548342 0.313 0.121 0.000000e+00 0 STPG2
## 4 0.000000e+00 2.736698 0.305 0.115 0.000000e+00 0 ARSE
## 5 0.000000e+00 2.383730 0.307 0.140 0.000000e+00 0 KITLG
## 6 0.000000e+00 2.866561 0.790 0.313 0.000000e+00 1 PEG10
## 7 0.000000e+00 3.766502 0.697 0.233 0.000000e+00 1 AFP
## 8 0.000000e+00 2.990434 0.391 0.124 0.000000e+00 1 HBB
## 9 0.000000e+00 3.679762 0.322 0.082 0.000000e+00 1 SLC35F1
## 10 0.000000e+00 3.906108 0.270 0.065 0.000000e+00 1 RPS4Y1
## 11 0.000000e+00 6.867828 0.530 0.006 0.000000e+00 10 MSC-AS1
## 12 0.000000e+00 6.805759 0.362 0.006 0.000000e+00 10 PLA2G5
## 13 0.000000e+00 7.102581 0.355 0.004 0.000000e+00 10 PKNOX2
## 14 0.000000e+00 6.804875 0.362 0.012 0.000000e+00 10 MGAT4C
## 15 0.000000e+00 6.890617 0.316 0.004 0.000000e+00 10 GEM
## 16 0.000000e+00 6.925527 0.665 0.012 0.000000e+00 11 CD247
## 17 0.000000e+00 6.493208 0.576 0.018 0.000000e+00 11 CARD11
## 18 0.000000e+00 6.459907 0.456 0.009 0.000000e+00 11 SLFN12L
## 19 0.000000e+00 6.769585 0.349 0.005 0.000000e+00 11 RUNX3
## 20 0.000000e+00 7.366675 0.327 0.004 0.000000e+00 11 BCL11B
## 21 0.000000e+00 2.446426 0.548 0.112 0.000000e+00 12 DPP10
## 22 0.000000e+00 2.527509 0.470 0.037 0.000000e+00 12 MIR646HG
## 23 0.000000e+00 3.061346 0.434 0.031 0.000000e+00 12 AC114812.2
## 24 0.000000e+00 2.497557 0.359 0.028 0.000000e+00 12 LINC01446
## 25 0.000000e+00 2.663045 0.255 0.009 0.000000e+00 12 TSHZ3
## 26 0.000000e+00 3.637234 0.608 0.036 0.000000e+00 13 GRIK2
## 27 0.000000e+00 3.985827 0.598 0.029 0.000000e+00 13 PLD5
## 28 0.000000e+00 4.118178 0.368 0.008 0.000000e+00 13 VWA2
## 29 0.000000e+00 4.212434 0.317 0.008 0.000000e+00 13 AL591686.2
## 30 0.000000e+00 3.675575 0.273 0.008 0.000000e+00 13 CCDC148-AS1
## 31 0.000000e+00 4.086553 0.814 0.044 0.000000e+00 14 AVIL
## 32 0.000000e+00 4.683535 0.779 0.022 0.000000e+00 14 ABCB5
## 33 0.000000e+00 4.194562 0.694 0.022 0.000000e+00 14 LTBP4
## 34 0.000000e+00 4.127928 0.550 0.020 0.000000e+00 14 ST6GALNAC5
## 35 0.000000e+00 4.457319 0.524 0.013 0.000000e+00 14 GNAS-AS1
## 36 0.000000e+00 7.931420 0.495 0.001 0.000000e+00 15 AL049629.1
## 37 0.000000e+00 7.944490 0.495 0.001 0.000000e+00 15 AC245041.2
## 38 0.000000e+00 7.382536 0.454 0.002 0.000000e+00 15 SLC5A1
## 39 0.000000e+00 7.387989 0.340 0.001 0.000000e+00 15 PRSS22
## 40 0.000000e+00 7.959469 0.289 0.001 0.000000e+00 15 LINC00842
## 41 0.000000e+00 4.403750 0.917 0.027 0.000000e+00 16 AC018467.1
## 42 0.000000e+00 4.179339 0.833 0.017 0.000000e+00 16 LINC01488
## 43 0.000000e+00 4.452957 0.683 0.009 0.000000e+00 16 AP000424.1
## 44 0.000000e+00 4.311117 0.500 0.009 0.000000e+00 16 AC011498.4
## 45 0.000000e+00 4.384808 0.383 0.006 0.000000e+00 16 AP003555.1
## 46 3.361555e-210 6.670709 1.000 0.040 9.745822e-206 17 TAT
## 47 2.422180e-107 5.321169 0.459 0.015 7.022383e-103 17 SAA4
## 48 1.474614e-97 4.985658 0.622 0.031 4.275201e-93 17 CYP4A11
## 49 2.801567e-79 5.268070 0.297 0.009 8.122303e-75 17 APOF
## 50 2.295483e-61 4.969043 0.973 0.143 6.655064e-57 17 ASS1
## 51 0.000000e+00 4.927727 0.971 0.045 0.000000e+00 2 CRP
## 52 0.000000e+00 5.272700 0.851 0.050 0.000000e+00 2 NNMT
## 53 0.000000e+00 5.050027 0.327 0.028 0.000000e+00 2 CYP2C8
## 54 0.000000e+00 5.752337 0.304 0.025 0.000000e+00 2 CYP2A6
## 55 0.000000e+00 5.530231 0.300 0.024 0.000000e+00 2 CYP2A7
## 56 0.000000e+00 3.498518 0.552 0.039 0.000000e+00 3 LINC00470
## 57 0.000000e+00 3.333775 0.559 0.069 0.000000e+00 3 GPC5
## 58 0.000000e+00 3.733413 0.512 0.048 0.000000e+00 3 AC098617.1
## 59 0.000000e+00 3.715704 0.434 0.031 0.000000e+00 3 TMEFF2
## 60 0.000000e+00 3.869333 0.256 0.010 0.000000e+00 3 CROCC2
## 61 0.000000e+00 1.946839 0.852 0.301 0.000000e+00 4 KCNQ1OT1
## 62 0.000000e+00 2.263821 0.606 0.134 0.000000e+00 4 KCNQ1
## 63 0.000000e+00 1.948869 0.437 0.092 0.000000e+00 4 AC022146.2
## 64 0.000000e+00 2.133614 0.305 0.054 0.000000e+00 4 PRDM16
## 65 0.000000e+00 1.957344 0.250 0.046 0.000000e+00 4 RTN4RL1
## 66 0.000000e+00 3.153081 0.401 0.013 0.000000e+00 5 C2orf48
## 67 0.000000e+00 2.669669 0.355 0.015 0.000000e+00 5 IQGAP3
## 68 0.000000e+00 2.722166 0.324 0.013 0.000000e+00 5 CDC45
## 69 0.000000e+00 2.760342 0.300 0.022 0.000000e+00 5 AC007608.1
## 70 0.000000e+00 2.911755 0.267 0.015 0.000000e+00 5 AC079594.2
## 71 0.000000e+00 3.972423 0.850 0.055 0.000000e+00 6 AP002518.2
## 72 0.000000e+00 3.899813 0.711 0.017 0.000000e+00 6 DTX1
## 73 0.000000e+00 4.037068 0.611 0.021 0.000000e+00 6 SLFNL1
## 74 0.000000e+00 4.164398 0.593 0.011 0.000000e+00 6 AC012613.1
## 75 0.000000e+00 4.209566 0.250 0.003 0.000000e+00 6 AC011498.4
## 76 0.000000e+00 4.485375 0.447 0.023 0.000000e+00 7 EPHA6
## 77 0.000000e+00 4.587825 0.426 0.024 0.000000e+00 7 NTM
## 78 0.000000e+00 4.528960 0.365 0.022 0.000000e+00 7 PLD5
## 79 0.000000e+00 4.176892 0.364 0.030 0.000000e+00 7 GRIK2
## 80 0.000000e+00 4.762748 0.251 0.007 0.000000e+00 7 AC019068.1
## 81 0.000000e+00 5.766574 0.469 0.010 0.000000e+00 8 PDE2A
## 82 0.000000e+00 5.310030 0.463 0.013 0.000000e+00 8 TOX2
## 83 0.000000e+00 5.217499 0.460 0.021 0.000000e+00 8 STAB1
## 84 0.000000e+00 5.316983 0.368 0.007 0.000000e+00 8 NOTCH4
## 85 0.000000e+00 5.130017 0.259 0.013 0.000000e+00 8 AC011476.3
## 86 0.000000e+00 5.556088 0.482 0.008 0.000000e+00 9 FGD2
## 87 0.000000e+00 6.078418 0.421 0.008 0.000000e+00 9 IGSF21
## 88 0.000000e+00 5.372153 0.370 0.006 0.000000e+00 9 WDFY4
## 89 0.000000e+00 5.836110 0.348 0.006 0.000000e+00 9 TRPM2
## 90 0.000000e+00 5.557584 0.319 0.004 0.000000e+00 9 CYTH4
I used SingleR to perform do automatic annotation of cell labels. The algorithm uses reference transcriptomic datasets of pure cell types to infer the cell origin. Based on the figure below most of the cell identities appear to be Hepatocytes and erythroblasts which are cell types found in the liver. This makes sense as the original publication is about identifying tumor cell populations in the hepatoblastoma environment.
Citation: Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol., 20, 163-172. doi:10.1038/s41590-018-0276-y.
## Manual Gene
Annotation Sources 0:Uhlén M et al., Tissue-based map of the human
proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419
1: AFP Sauzay, Chloé, et al. “Alpha-foetoprotein (AFP): A multi-purpose
marker in hepatocellular carcinoma.” Clinica chimica acta 463 (2016):
39-44. 2: CYP2C8 Uhlén M et al., Tissue-based map of the human proteome.
Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 3: GCP5
Bosworth AP, Contreras M, Sancho L, Salas IH, Paumier A, Novak SW, Manor
U, Allen NJ. Astrocyte glypican 5 regulates synapse maturation and
stabilization. Cell Rep. 2025 Mar 25;44(3):115374. doi:
10.1016/j.celrep.2025.115374. Epub 2025 Mar 5. PMID: 40048429; PMCID:
PMC12013928. 4: RTN4RL1 Uhlén M et al., Tissue-based map of the human
proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419
5: C2orf48 IQGAP3 CDC45 Uhlén M et al., Tissue-based map of the human
proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419
6: DTX4 Wang, Zonggui, et al. “E3 ubiquitin ligase DTX4 is required for
adipogenic differentiation in 3T3-L1 preadipocytes cell line.”
Biochemical and Biophysical Research Communications 492.3 (2017):
419-424. 7: PLD5 Uhlén M et al., Tissue-based map of the human proteome.
Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 8: NOTCH4
Lai, Peng-Yeh, Chong-Bin Tsai, and Min-Jen Tseng. “Active form Notch4
promotes the proliferation and differentiation of 3T3-L1 preadipocytes.”
Biochemical and biophysical research communications 430.3 (2013):
1132-1139. 9: FGD2 Huber, Christoph, et al. “FGD2, a CDC42-specific
exchange factor expressed by antigen-presenting cells, localizes to
early endosomes and active membrane ruffles.” Journal of Biological
Chemistry 283.49 (2008): 34002-34012. 10:PLA2G5 Uhlén M et al.,
Tissue-based map of the human proteome. Science (2015) PubMed: 25613900
DOI: 10.1126/science.1260419 11: CD247 Eldor, Roy, et al. “CD247, a
Novel T Cell–Derived Diagnostic and Prognostic Biomarker for Detecting
Disease Progression and Severity in Patients With Type 2 Diabetes.”
Diabetes care 38.1 (2015): 113-118. 12:Caubit, Xavier, et al. “TSHZ3
deletion causes an autism syndrome and defects in cortical projection
neurons.” Nature genetics 48.11 (2016): 1359-1369.
My strategy for naming the clusters was starting with the Human Protein Atlas which is cited for several of the clusters above. I did a search for each of the five top genes in the cluster in the Human Protein Atlas and specifically looked for cell type expression cluster and tissue specificity in the search. If there was a hit for the gene (not all genes were found) and some some enrichement in a specific cell type or tissue, I did a second search of the gene in PubMed to confirm if the two terms appeared together. I then went through the rest of the two 5 genes in order to get a sense of a consensus. If I were doing this for a publication I would probably cite a paper for each gene that I searched up which I did not do in this case - I either cited the Human Protein Atlas or a publication with the one of the genes.
Below is Figure 3B replicated. This figure highlights marker genes used to identify the common cell population in liver and tumor samples. Although the shapes of our UMAPs differ - the general down regulation and upregulation of the markers across the plot between the publication and my reproduction of the figure is somewhat consistent.
Below is Figure 4A
replicated
This figure plots mean normalized expression values across the three types of cells that are in this analysis: background hepatocytes, tumor cells from primary hepatoblastoma (HB), and tumor cells from patient-derived xenograft (PDX) models. This involved, mean-averaging gene expression per cell type, normalizing to remove baseline shifts, comparing expression trends across types and then quantifying the correlation with R2.The R squared values between the publication and my findings here are different but for PDX tumor cells and tumor cells the highest correlation remains with the published R squared at 0.902 and mine at 0.899 which is reassuring.
I performed Pseudotime analysis using Monocle3 to infer the transcriptional trajectory from background hepatocytes to tumor and PDX-derived cells. This approach can model how cellular identities transition over time and assess whether PDX cells faithfully recapitulate the progression observed in primary tumors. Additionally, by ordering cells along this trajectory it may be possible to identify regulated genes and potential regulators of tumor transformation.